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Abstract: 

In  the  preceding  paper  [16j  we  presented  several  systolic  algorithms  for  Factorial  Data  Analy¬ 
sis  all  running  on  the  same  triangular  systolic  array  with  orthogonal  connections:SARDA(Systolic 
Array  for  Data  Analysis). We  restricted  our  study  to  matrix  computations. In  this  paper  will  now 
turn  ourselves  to  the  symmetric  eigenvalue  problem  on  SARDA  and  study  especially  tridiagonal- 
ization  followed  by  multissection  for  the  eigenvalues  and  inverse  iteration  method  for  the  associated 
eigenvectors. 
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1.  Introduction: 


Let  us  remind  our  purpose:we  wanted  to  devise  efficient  algorithms  for  Factorial  Data  Analysis 
all  running  on  the  same  triangular  systolic  array  SARDA.We  are  now  studying  the  synrunetric 
eigenvalue  problem  on  the  symmetric  positive  definite  matrix  W  of  size  k  (for  the  other  algorithms 
involved  see  part  I  [16]). 

As  we  have  a  systolic  array  of  size  s  with  usually  k  >  s,we  use  subspace  iteration  with  block¬ 
partitioning  (see  [5],[l5])and  with  these  two  techniques  we  can  restrict  our  study  to  the  symmetric 
eigenvalue  problem  on  W  of  size  «  and  obtain  the  i  largest  eigenvalues  of  W  with  t  <  s. 

ActuzJly  a  large  variety  of  systolic  algorithms  have  been  studied  for  the  symmetric  eigen¬ 
value  problem  [10].  We  can  distinguish  two  main  classes  of  methods:  the  first  is  the  Jacobi  method 
[l], [20], the  second  is  a  tridiagonalization  followed  either  by  QR  iteration  with  shift  [8], or  by  bissec- 
tion  [21].  No  method  actually  proposed  is  linear  in  time. 

Other  methods  are  possible  when  the  symmetric  matrix  W  is  positive  definite:  LR  Cholesky 
(chapter  8  of  [22])  which  is  an  adaptation  of  the  LR  algorithm  to  the  symmetric  case  on  the  matrix 
W  tridiagonalized  before, or  the  singular  value  decomposition  (SVD)  of  R  upper  triangular  matrix 
got  by  Cholesky  factorization  o{W(W  =  R^ R).{see  [15]) 

Let  us  now  describe  briefly  several  of  the  methods  listed  above  and  some  of  their  drawbacks. 

•  The  Jacobi  method  [1]  consists  of  applying  planar  rotations  on  the  rows  and  r  i  imns  of  W  to 
annihilate  its  non  diagonal  elements. This  algorithm  requires  a  square  array  of  <^^  4  processors 
with  octogonal  connections  (every  processor  has  eight  neighbors) .After  O(slogs)  time  steps 
we  obtain  the  eigen  elements  of  W'.This  method  is  quite  fast  but  it  shows  some  difficulties  to 
implement  because  at  every  time  step,every  processor  has  to  communicate  with  its  8  nearest 
neighbors. So  it  requires  a  progr2Lmmation  of  data  exchanges  quite  complex. Moreover  as  we 
have  no  diagonal  connections  between  processors  we  have  to  simulate  them  which  complicates 
the  programmation  and  delays  the  computations. 

•  The  QR  iteration  for  band  matrices  can  be  applied  to  the  symmetric  tridiagonalized  ma¬ 

trix  W’(see  [8], [9]). One  iteration  takes  about  2s  time  steps  but  without  shift  the  convergence 
is  slow. (The  non  diagonal  elements  =  2,...,s  of  converge  to  0  at  the  speed  of 

~  where  the  eigenvalues  A,,i  =  1, . . .  ,s  are  arranged  by  decreasing 

order).  The  problem  is  that  in  the  QR  iteration  you  need  to  have  the  kth  shift  to  compute 
the  kth  QR  factorization  Wj  -  o-*/  =  and  you  can  only  get  it  when  you  have  finished 

the  preceding  iteration  =  Rjc-iQu-i  at-iI.So  you  cannot  fit  the  iterations  into  each 

other  and  there  is  a  great  loss  of  time. 

•  The  LR  Cholesky  iteration  can  also  be  applied  to  the  symmetric  tridiagonal  matrix  W’  (see 

[22]). One  iteration  takes  about  2s  time  steps  and  the  iterations  can  fit  into  each  other  but 
the  convergence  is  very  slow  especially  when  several  eigenvalues  are  close  to  one  another. 
(The  non  diagonal  elements  =  2,...,s  of  converge  to  0  at  the  speed  of 

<i.*K-"i‘>  =  0((A,/A,-x)*/’)). 

•  We  did  not  use  the  methods  related  to  the  SVD  of  /?,the  Cholesky  factor  of  W  {W  = 
/?^/?), because  they  are  of  the  same  type  as  the  Jacobi  method  described  above.The  Hestenes 
method  consists  of  applying  planar  rotations  to  the  columns  of  A  to  orthogonalize  them 
([2], [19]). It  is  analogous  to  the  Jacobi  method  applied  to  A^R.The  SVD-Jacobi  method  ap¬ 
plies  once  more  planar  rotations  on  the  rows  and  columns  of  R  to  annihilate  its  non  diagonal 
elements  but  as  i!  is  not  symmetric  the  rotations  applied  to  the  rows  and  the  columns  of  R 
are  not  the  same  ([3], [14]). 


After  this  short  overview  of  the  methods  actually  available  for  the  symmetric  eigenvalue  prob¬ 
lem, (see  also  table  l),we  will  now  detaile  our  choice:  the  tridiagonalization  of  PV', described  in 
section  2, followed  by  the  multissection  method  .described  in  section  3, for  the  eigenvalues  of  W".In 
section  4  we  expleiin  how  to  get  the  eigenvectors  of  W  with  the  inverse  iteration  method. 

2.  Tridiagonalization  of  W'l 

Let  W  be  a  symmetric  matrix  of  size  s,we  have  to  define  a  unitary  matrix  P  such  that 
Wq  =  PW'P'^  is  tridiagonal. To  assure  the  numerical  stability  (see  [22])  the  matrix  P  can  be  either 
a  product  of  Householder  transformations  [11], or  a  product  of  rotations  [7],[8],[9].The  tridiagonal¬ 
ization  is  done  column  by  columnrwe  apply  series  of  projections  or  rotations  by  premultiplication 
to  annihilate  a  column  of  IV', then  by  postmultiplication  with  the  transpose  matrix  of  these  op- 
erators,we  annihilate  the  corresponding  row.So  we  get  a  matrix  unitary  similar  to  W  with  one 
column  and  row  annihilated  [17], [18]. This  method  requires  0(s*)  time  steps,a  triangular  array  of 
s(s  +  l)/2  processors  and  a  systolic  shifter  which  allows  to  transpose  matrices  without  delays.  We 
use  Givens  rotations  to  emnihilate  the  elements  of  W'. Their  utilization  is  up  to  date  especially  for 
the  QR  factorization  [7], [9]. 

Let  [lUt-i,  be  a  couple  of  elements  of  W'.In  order  to  annihilate  tuj  j  by  using  the  pivot 

Wi-ij  we  build  the  Givens  rotation  Ri  4  =  (  \  jggjjgj  ^jy. 

V  -  sin  6i  j  cos  9i  j  J  ■' 

P  =  [tw?-i,>  +  ,  cos  0ij  =  ,  sin  9ij  =  Wij/p. 

We  have  then  [u;,_i_y, u;,j]^  =  [p,0]^. 

Set  down  W  =  (tw,,,)i<,j<,.We  apply  (s  -  2)  rotations  by  premultiplication  to  the  (s  -  1)  last 
rows  of  W  in  order  to  annihilate  successively  . tus,!- 

u»j,i  is  annihilated  with  the  pivot  and  -R»,i[u>,_i,i, =  [wj_i_i,0]^. 

W-i.i  is  annihilated  with  the  pivot  and  i]^  =  [wj_2  i,0]^. 

After  the  (s  -  2)th  rotation  we  have  the  (s  -  2)  last  elements  of  the  first  column  of 
QJW  =  i?3,i . . .  W'  annihilated.  This  operation  is  called  the  first  “forward  pass”. We 

then  apply  these  (s  -  2)  rotations  by  postmultiplication  to  the  (s  -  1)  last  columns  of  QJW  and 
we  compute  Qf(QfW')^  =  Qf W'Qi.This  is  the  first  “backward  pass”. In  fact  we  only  compute  the 
product  of  QJ  with  the  (a  -  1)  l2ist  rows  of  {Q\W)^  because  we  already  know  that  the  first  row 
of  QiWQi  is  equal  to  the  first  column  of  Q^W'.And  we  get: 
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“'2.1 

0  . 
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“'2,2 

“'2,3  • 

“'I,* 
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“'3,3  • 

^  0 

<2 

■ 
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The  upper  indices  of  the  elements  of  Q^W'Qi  represent  the  number  of  rotations  applied  to 
the  elements  during  the  first  forward  and  backward  passes. 

Then  we  apply  (s  -  3)  rotations  by  premultiplication  to  the  (s  -  2)  last  rows  of  W*  =  QjW'Qi 
in  order  to  annihilate  successively  2,  «^J_i,2.  •  •  •  “  *'•1®  product  of  these  (s  -  3)  rota¬ 
tions  the  second  forward  pass  consists  of  Q\W^  and  the  second  backward  pass,of  = 

QlQlWQ^Q^. 


Finally  after  (a  -  2)  forward  and  backward  passes  we  get  the  symmetric,  tridiagonal  matrix 

W'o  =  Ql-iQLi  Ql^'Qi  Q.-^Q.-2 

All  these  unitary  transformations, products  of  Givens  rotations,are  applied  by  the  mean  of  a 
trianguleir  array  of  s(s+  l)/2  processors  with  orthogonal  connections  initially  conceived  by  Gentle¬ 
man  and  Kung  [7]  for  the  QR  factorization  of  a  dense  matrix. This  array  is  composed  of  cells  of  two 
types:  the  diagonal  cells  which  create  the  rotations  and  the  non  diagonal  cells  which  apply  them  to 
the  input  data  they  receive  (see  figures  3a  and  3b). The  matrix  is  introduced  into  the  array  column 
by  column ,esu:h  colunui  to  annihilate  entering  the  cell  which  creates  the  rotations,and  the  other 
columns  entering  the  other  cells  of  the  same  row. The  first  column  of  the  matrix  is  treated  by  the 
last  row  of  the  array  composed  of  s  processors  (number  of  non  treated  columns  of  the  array), the 
second  column  is  treated  by  the  next  to  last  row  composed  of  (s  —  1)  processors  and  so  on  until  the 
(s  —  2)th  column  treated  by  the  third  row  composed  of  three  processors. 

From  this  array  of  Gentleman  and  Kung,Schreiber  [17], [18]  deducts  an  array  with  the  same 
structure  which  realizes  the  tridiagonalization  of  a  symmetric  matrix. This  is  done  column  by  column 
as  we  described  it  above.The  last  row  of  the  array  is  composed  of  s  processors  (number  of  columns 
of  =  IV'),  the  far  right  one  constructs  the  rotations  and  the  others  apply  them. This  last  row 
computes  first  the  matrix  product  QJW  (see  figure  l).At  the  exit  of  the  row  the  matrix  Q\W'  is 
transposed  before  being  reintroduced  in  that  same  row  for  the  computation  of  (see 

figure  2). 

After  this  second  pass  in  the  last  row  of  the  array  IV*  =  QfWQi  is  introduced  into  the  next 
to  last  row  of  the  array  composed  of  (s  -  1)  processors  (number  of  non  treated  columns  of  tV*)  for 
the  second  forward  pass,  computation  of  IV*, then  after  transposition  of  (5^1V*,for  the  second 
backward  pass, computation  of  IV*  =  Q^W^Q^.Knd  so  on  until  the  (s  -  2)th  forward  and  backward 
passes  where  the  matrix  IV*"*  at  the  exit  of  the  third  row  of  the  array  will  be  tridiagonal. We  get 
the  tridiagonal  elements  of  PV*"*  in  the  diagonal  cells  where  they  are  memorized  when  we  apply 
the  forward  passes  (see  for  exeunple  figure  1).  Finally  we  have: 


PV*"*  = 


1 

...3»-7 


S*-7 

2(»-2) 


[tuf  i,ty24]  are  stored  in  processor  S)«  [w|,2>'^3,2]  stored  in  processor  a  —  l,e  —  1,.  • 
[iyj*'j[^,°_i, ti;**"^]  and  are  stored  in  processors  2,2  and  2,1. 
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Figure  1:  Moving  of  ihej[e  —  1)  Iwt  rows  of  W  in  the  array 
for  the  matrix  product  Q^W  (s  =  4). 
first  forward  pass:  Qf  =  Bt-i.ifit.i- 

The  duration  of  the  first  forward  pass  is  2s  time  steps. The  first  backward  pass  can  only  begin 
from  the  instant  f2s  +  1)T  when  we  have  computed  the  last  element  of  QfW'. 
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Figure  2:  Moving  of  the  (#  —  1)  last  rows  of  (Qf 
array  for  the  matrix  product  QiiQiW')^  (s  =  4). 
Srat  backward  paaa. 


The  duration  of  the  first  backward  pass  is  2(s  -  1)  time  steps.The  second  forward  pass  can 
begin  as  soon  as  we  get  the  first  elements  of  from  the  instant  (2s  +  3)T. 
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Figure  3:  (3a)  program  of  a  diagonal  celliThe  rotation  R 
which  annihilates  y  with  the  pivot  x  is  computed: 

(o)  "  ( 0  (y)  0=^^'  +  ,c  =  x/p, 

s  =  y/p.  Then  x  =  p  is  memorized  in  the  array. 

(3b]  program  of  a  non  diagonal  celhThe  rotation  R  is  apllied: 

Let  us  now  evaluate  the  duration  of  the  tridiagonalization.The  first  forward  pass  takes  2s  time 
steps. The  second  forward  pass  begins  at  the  instant  (2s  +  3)T  =  [2(s  +  1)  +  l]T  and  ends  at  the 
instant  [(2s+3)+2(s-l)-l]T  =  4sT'.The  third  forward  pass  begins  at  the  instant  [2(s+l)+2s+l]r 
and  ends  at  the  instant  [(4s  +  3)  +  2(s  —  2)  —  ijr. 

The  (s-2)th  forward  pass  begins  at  the  instant  [2(s+l)+2s+. .  .-|-2((s+l)-(s-2)-f  2)  +  l]T  = 
(2[(s  +  1)  +  s  +  . . .  +  5]  +  l)T,the  (s  -  2)th  forward  and  backward  passes  take  10  time  steps. So  the 
tridiagonalization  ends  at  the  instant  (2[(s  +  !)(»  +  2)/2  -  10]  +  10  +  l)r  =  [(s  +  l)(s  +  2)  -  9|r 
and  it  takes  ((s  +  l)(s  +  2)  -  9)  time  steps. 

Nevertheless  for  the  tridiagonalization  of  a  symmetric  matrix  of  size  s  on  his  triangular  array 
of  s(s  +  l)/2  cells  Schreiber  [17], [18]  assumes  the  existence  of  a  systolic  shifter  which  transposes  the 
matrices  between  each  forward  and  backward  passes  without  delay  time. It  is  possible  to  avoid  the 
use  of  this  shifter  by  creating  in  the  memories  of  each  non  diagonal  cell  internal  stacks  (LIFO  Last 
In  First  Out)  allowing  the  storage  of  intermediate  results  of  the  matrices  W*  between  the  forward 
and  backward  passes. Thus  we  create  three  internal  stacks  for  the  cells  t,t  —  1  for  •  =  3, . . .  ,s,one 
internal  stack  for  the  cells  i,  1  for  i  =  3, . . .  ,s  and  two  internal  stacks  for  all  the  other  non  diagonal 
cells. 

Then  the  computations  are  done  as  following;the  forward  passes  are  executed  as  described 
before, the  matrix  W'  is  introduced  vertically,column  by  column,  but  the  elements  of  each  column 
are  memorized  in  the  first  LIFO  of  each  non  diagoned  cell  a  —  i,j  called  F,-ij  instead  of  being 
sent  in  the  shifter  (see  figure  5). Except  the  elements  of  the  next  to  last  column  of  W*  which  are 
stored  in  F,-,,,-,-!  and  then  reintroduced  vertically  in  the  cell  s  -  i,8  -  i  -  l,the  other  columns 
j  of  W*  are  reintroduced  vertically  in  the  cells  s  -  t,  j  and  then  horizontally  to  the  right  neighbor 

6 


^  .Then  y  is  memorized  in  the  cell. 


cells  8  -  i,j  +  1  for  the  backward  pass  and  stored  in  the  second  LIFOs  from  where  they 

will  be  taken  out  for  the  next  forward  pass  (see  figure  6). These  second  LIFOs  are  linked 

up  directly  with  the  cells  a  -  i  -  l,j  of  the  (s  —  i  —  l)th  row  of  the  array  which  requires  a  slight 
modification  of  the  array  (see  figures  5-6-7). 

As  the  Systolimag  machine, (see  part  I  [16]), has  no  physical  diagonal  connections  we  can  sim¬ 
ulate  them  as  followingrsee  figures  4a,4b.We  can  consider  that  there  is  no  delay  time  when  we  use 
the  diagram  4b,  instead  of  the  diagram  4a  if  we  work  by  data  flow.In  fact  in  that  case  as  soon  as 
the  processor  i  —  l,j  receives  the  data  from  the  processor  i,j  it  sends  it  to  the  processor  i  —  1,  j  —  1 
without  delay  time. It  follows  that  the  backward  passes  are  not  executed  in  the  same  way  as  in  the 
array  of  Schreiber.In  his  model  the  rotations  computed  by  the  diagonal  cell  s  -  i,s  -  i  during  the 
(i  -|-  l)th  forward  pass  are  memorized  and  then  resent  in  the  (s  —  i)th  row  of  the  array  at  the  good 
time  to  execute  the  backward  pass. These  rotations  reach  every  processor  of  the  (s  —  i)th  row  one 
by  one  and  then  are  applied  to  the  rows  of  ■  In  our  model  the  rotations  sent  during  the 

forward  pass  along  the  (a  —  i)th  row  have  to  be  memorized  in  the  processors  from  their  first  utiliza- 
tion,the  jth  rotation  memorized  in  the  {j  +  l)th  processor  of  the  row  in  order  to  be  reused  during 
the  backward  pass. Without  using  systolic  shifter,the  times  to  execute  the  forward  and  backward 
passes  are  doubled. The  loss  of  time  comes  from  the  transposition  of  the  matrix  W*  between  every 
forward  and  backward  passes. 

The  first  forward  and  backward  passes  take  [2s  ■+■  2(s  —  1)]  time  steps. The  second  forw2u-d  and 
backward  passes  take  [2(s  -  1)  -f  2(s  -  2)|  time  steps. 

The  (s  -  2)th  forwzird  and  backward  passes  take  [2(s  -  (s  -  3))  -|-  2(s  -  (s  —  2))]  time  steps. So 
the  tridiagonalization  takes  (2s^  -  8)  time  steps. 


Figure  4:  (4a)  is  replaced  by  (4b). 
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Figure  5;  First  forward  pass:  computation  of  the  matrix 
product  Q^W  (8=4).  Qf  = 


The  duration  of  the  first  forward  pass  is  still  2s  time  steps.After  applying  the  rotations  R^  i 
and  the  columns  of  Q\W  are  stored  in  the  LIFOs  F,,,  ,  i  =  1, . . .  ,s  -  1.  Only  the  used 

connections  for  the  first  forward  pass  are  represented  by  continuous  lines. 
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Figure  6:  First  backw&rd  pass;  computation  of  the  matrix 
product  <5f(<3fW')^. 

The  first  backward  pass  begins  at  the  instant  (2s  +  l)r  after  the  computation  of  the  last 
element  of  QJW  and  takes  2(s  -  1)  time  steps.The  second  forward  pass  can  here  only  begin  at 
the  instant  (4s  -  1)T  after  the  computation  of  the  last  element  of  Qf  H^'Qi.Only  the  active  parts 
during  the  first  backward  pass  are  represented  by  continuous  line. 
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Figure  7:  Second  forward  pass;  computation  of  the  matrix 
product  QiW^.  Ql  = 

The  duration  of  the  second  forward  pass  is  2(*  -  1)  time  steps. 


3.  Multisection  method 


The  bisection  method  can  be  applied  to  tridiagonal  symmetric  matrices.lt  allows  to  determine 
all  the  eigenvalues  of  the  matrix  in  an  interval  fixed  in  advance. 

As  in  Factorial  Data  Analysis  the  eigenvalues  are  all  positive, we  can  determine  a  research 
interval  for  the  eigenvalues  from  the  beginning, for  example  [0,  ||iV'||i](||lV||i  =  maxy  E.  b.,>l)  We 
can  often  choose  [0,  IJ.This  method  relies  essentially  on  the  following  theorem  (see  [6]  ppl20-123).Set 
down: 
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We  define  the  Sturm  series  of  characteristic  polynomials  of  the  dominant  submatrices  of  size  j  of 
Wo:po(A)  =  l,Pi(A)  =  (61  -  A),  . .  .,Pj(A)  =  (hj  -  A)py_i(A)  -  c^_jPy_2(A).  Set  down  1  <  j  <  s  and 
p  G  /?,if  we  define: 

_  /  sign(Pj(M)),  if  Pj(p)  7^  0 
\  stpn(py_i(p)),  ifpy(p)=0 
then  the  number  N{j\yi)  of  sign  changes  of  the  set  E{j,n)  =  {syfipo(p),  s^npi(p), . . . ,  spnpy(p)}  is 
equal  to  the  number  of  roots  of  the  polynomial  py  which  are  strictly  smaller  than  p. 
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3.1.  description  of  the  method: 

It  is  based  on  the  ideas  of  Schreiber  [21]. Suppose  we  are  using  (s  +  1)  processors ,th at  is  to 
say  for  example,the  last  row  of  our  triangular  array  plus  one  supplementary  processor. The  s  first 
compute  the  series  (p,(p))  for  1  =  l,...,s,p  €  R  and  the  last,stores  the  numbers  N{a\y)  for 
different  values  of  p.The  array  works  like  this:we  introduce  a  real  p  into  the  first  processor  s,  1 
which  computes  pi(p)  and  A''(l;p)  and  sends  its  results  to  the  second  processor  s,2  which,at  its 
turn,computes  P2(p)  and  W(2;p)  and  so  on  until  the  sth  processor  s,s  which  computes  p#(p)  and 
N{s\  p). Every  time  we  send  a  real  p  in  the  row  of  (s  +  1)  processors, we  have  to  wait  s  time  steps 
to  get  A^(s;p).The  goal  is  then  to  use  as  best  as  possible  every  processor  at  every  time  step, and 
for  that,to  choose  judiciously  the  vedues  p  to  introduce  into  the  array  in  such  a  way  that  every 
processor  is  as  active  as  possible  during  all  the  course  of  the  method. In  the  other  way  we  have 
to  choose  the  multisection  method  sis  efficient  as  possible,that  is  to  say,the  one  requiring  sis  less 
arithmetic  operations  as  possible. In  order  to  do  that  let  us  compare  different  methods:  assume  we 
have  Ai  <  A2  <  •  •  •  <  A,  contained  in  [ao,6o]  and  that  we  search  the  ith  eigenvalue  Aj. 

1.  bisection  method:  we  set  cq  =  (oq  +  6o)/2  and  we  compute  iV(s;co),if  N[s\cq)  >  i  then 
Aj  G  [ao,co],we  set  oi  =  oq  and  bi  —  co.else  A,-  G  [co,fro]  and  we  set  oi  =  cq  and  bi  =  bo.We  set 
Cl  —  (ai  +  bi)/2  and  we  compute  iV(s;  ci),. .  .Thus  we  determine  a  series  of  intervals  fitting  into 
esich  other  [at,6jk],A:  G  N  such  that  Aj  G  [a*, 6*]  and  b^  -  ak  =  {bo  -  ao)/2*. 

2.  multisection  of  order  p:  we  set  cy  =  oq  +  j{bo  -  ao)/p  and  we  compute  N{s-,Cj)  for 

j  =  0,  ...,p—  l.If  N{a;Cj)  -  N(a;Cj^i)  =  1  then  A,  G  [cy_i,cy|,we  set  Oj  =  cy_i,6j  =  cy  and 
Cy  =  a'l  +  j{b'i  -  a'i)/p,and  we  compute  N(s;  cy)  for  j  =  0, . . .  ,p  -  1,. .  .Thus  we  determine  a  series 
of  intervals  fitting  into  each  other  [aj,6fc]>^  ^  ^  s'^ch  that  Aj  G  and  6^  -  =  (bo  -  ao)/p*. 

Assume  we  want  to  get  the  eigenvalue  A,  with  a  precision  PREC,ii  at  the  kth  step  we  set 
LB\i]  =  (a*  +  6i)/2  we  get  A,  with  a  precision  of  (o*  —  6t)/2.For  the  method  1  we  will  need  to 
do  k  computations  of  A^(s;c,)  with  k  =  [ln((f>o  -  ao)/2PREC)/  ]ii2\  and  for  the  method  2,p  x  k 
computations  of  IV(s;c,)  with  k  =  [ln((6o  -  oo)/2Pi?i?C)/lnp].The  efficiency  of  the  multisection 
method  of  order  p  related  to  the  bisection  method  is  then:  Ep  =  0(log2p/p)  [l3].The  optimal 


number  p*  will  have  then  to  be  eis  small  as  possible  and  also  optimize  the  output  of  every  processor 
(idle  times  minimized). 

Let  us  look  now  at  the  parallelized  method  principle. We  are  only  interested  in  Factorial 
Data  AnJilysis  in  the  greatest  eigenvalues, for  example  those  greater  than  a  bound  Xq  fixed  in 
advance. We  then  cut  the  interval  [-X^o.  ||^*||i]  into  (s  +  2)  pieces  of  equzd  length  po  =  (||W'||i  - 
Xo)/(8  +  2)  and  we  introduce  successively  the  reals  X°  =  Xo  +  «po  at  the  instants  (i  +  1)7’  for 
1=0,...,  s  +  2  into  the  eirray.lt  then  computes  X(s;  X°)  for  i  —  0, ,  s  +  2  and  WB[i]  =  N(s;  X°)  - 
IV(s;X°_j)  for  I  =  1, . . . ,  s  +  2.  At  the  instant  sT  we  get  A^(s;Xo),we  know  then  that  there  are 
k  :=  8  —  N(s;Xq)  eigenvalues  contained  in  [Xq,  (]W'||i].When  the  last  X^  is  introduced  into  the 
array  the  first  difference  JVS[l|  is  computed;we  can  then,  from  this  instant, define  new  values  A,* 
to  introduce  into  the  array  for  the  second  sweep  (if  iVB[l]  >  l).We  choose,as  for  the  first  sweep, 
and  for  all  the  next  sweeps,a  set  of  about  (s  +  2)  values  Xl  such  that  they  2u:e  as  less  interruptions 
as  possible  of  the  activity  of  the  processors  between  two  sweeps.  We  will  have  at  most  k  subinter¬ 
vals  containing  eigenvalues ;we  thus  choose  to  cut  each  of  them  into  /  equal  parts  with 

/  ;=  [(s  +  2)/fc].These  1  parts  will  be  of  length  pi  —  (||W'j|i  -  Xo)/(s  +  2)1. 

We  introduce  successively  the  Xj  —  +  jpi  for  «  such  that  IVS[i]  >  0  and  for  j  —  1, . . 

/  —  l,into  the  array  which  computes  the  N(s;  Xj)  as  well  as  N B1  :=  iV(s;  Xj)  —  N{s\  Xj_i). After  this 
second  sweep  we  determine  subintervals  of  length  (||W'||i  -  Xo)/(s  +  2)/  containing  eigenvalues. And 
so  on  for  the  following  sweeps:we  set  p,  =  p,_i//  and  we  cut  the  subintervals  containing  the  eigen¬ 
values  into  /  equeil  parts  of  length  p^. After  K  sweeps  the  subintervals  containing  the 

eigenvalues  will  be  of  length  (||kF'||i  —  .X^o)/(s  +  2)/^“^. By  setting  down 

Z/B[t]  =  set  eigenvedue  with  a  precision  of 

(lllV'||i-Xo)/2(s  +  2)/^-i'. 

3.2.  Number  of  arithmetic  operations: 

The  computation  of  iV(s;  p)  requires;  s  integer  additions,(2s  - 1)  real  soustractions,(45  -  5)  real 
multiplications  and  s  sign  comparisons.  The  first  sweep  requires  (s  +  2)  computations  of  N{s-,fi) 
and  the  following  sweeps,at  most  I  x  k  computations  of  N{s-,  p)  where  /  x  ~  s  +  2. So  K  sweeps 
will  require  about  /f(s  +  2)  computations  of  W(s;p).  To  get  the  eigenvalues  of  W'  with  a  precision 
PREC  we  will  have  to  take  K  —  [ln((ljW'||i  -  Xo)/2kPREC)/\n{s  +  2)/fc],Emd  so  to  execute 
0(s(s  +  2)/ln(s  +  2))  arithmetic  operations. 

3.3.  Duration  of  the  algorithm: 

The  first  sweep  requires  2(s  +  2)  time  steps,(s  +  2)  time  steps  for  the  introduction  of  the  X? 
into  the  array, and  (s  +  2)  for  the  computation  of  the  IV(s;X°).The  following  sweeps  require  k  x  I 
time  steps,  that  is  to  say  about  (s  +  2)  time  steps  for  the  computation  of  the  W(s;Af^).So  our 
method  requires  (if  -  1)  x  fc  x  /  +  2(s  +  2)  ~  (if  +  l)(s  +  2)  time  steps. Finally  our  multisection 
method  can  be  executed  in  0((s  +  2)/ln(5  +  2))  time  steps  with  0(s(s  +  2)/ln(s  +  2))  arithmetic 
operations. 

The  figure  8  shows  us  the  array  for  our  method, as  well  eis  the  sending  of  the  data  Af°  for  the 
first  sweep. The  processor  8,j  for  2<  j  <  8  computes  the  jth  char2icteristic  polynomial  Pj(X°)  for 
I  =  0, ...,«  +  2  from  the  characteristic  polynomieils  pj-i(Xj')  and  P;-2(X°)  and  the  elements  cy-i 
and  bj  of  the  matrix  Wq  {Wq  is  the  matrix  W  tridiagonalized);cy_i  and  bj  are  memorized  in  the 
cell  for  other  computations  of  characteristic  polynomials  Pj{X^)  for  m  =  1,2,...  The  processor 
a,j  also  updates  the  variable  NNCS  which  computes  the  number  of  sign  changes  of  the  series 
{pj{Xf))J  =  0, . . .  ,s.The  last  processor  s,s  +  1  computes  the  =  iV(s; X?)  -  iV(s;X?_j)  for 

«  =  1,. . .  ,8  +  2  and  determines  the  intervals  [X?_i,X°]  containing  eigenvalues.lt  fixes  then  a  new 
set  of  values  Xj  =  X°_j  +  jpi  for  j  =  1, . . . ,  /  -  1  to  test  during  the  second  sweep. 


4.  Inverse  iteration  method: 

After  the  tridiagonalization  of  W  into  Wq  with  the  product  of  Givens  rotations  P: 
Wq  —  PiV'P^.and  after  the  computation  of  the  eigenvalues  of  W  with  the  multisection  method  of 
order  (s  +  2)  on  iVo,it  remains  to  determine  the  eigenvectors  of  W  . 

We  use  the  inverse  iteration  method  [4]  which  requires  to  know  quite  accurate  approximations 
of  the  eigenvalues  to  get  a  fast  convergence. Assume  a  is  an  approximation  of  the  eigenvdue  A, then 
to  compute  the  eigenvector  x  associated  with  A, we  build  the  series  (<lk)keN  of  unitary  vectors 
converging  on  i  and  defined  by; 

•  90  =  «/ll«li2- 

•  resolution  of  {W  —  al)zk  —  qk-i  ,  k  =  1,2, .. . 

•  9fc  =  Zk/W^kh- 

It  is  the  power  method  applied  to  {W'  —  <T/)“^,the  convergence  factor  is  then: 
|A-ff||/ maxA^A.(|At  -We  apply  here  this  method  to  the  tridiagonalized  matrix  Wq  and  we  use 
as  eigenvalues  approximations, the  values  computed  by  the  multisection  method  .  The  tridiagonal 
system  (Wq  -  al)zk  =  9t-i  can  be  solved  by  LU  feictorization  or  by  QR  factorization  .As  a  is  very 
close  to  A  the  matrix  (Wq  -  ff/)  is  almost  singular, and  we  can  not  use  the  LV  factorization  without 
partial  pivoting  for  stability  and  pivoting  does  not  pareJlelize  easily. 

So  first  we  determine  a  product  of  (s  —  1)  Givens  rotations  such  that  Q^{Wo  —  al)  =  R 
is  upper  tri2mgular  and  we  compute  =  Q^qk-i  and  then,we  solve  the  triangular  system 

Rzk  =  The  QR  factorization  is  not  done  column  by  column  from  the  triangular  array  of 

s(s  +  l)/2  processors  of  Gentleman  and  Rung  [7], but  diagonal  by  diagonzil  from  the  array  of  Heller 
and  Ipsen  [9]  for  bauid  matrices. 

In  this  array  composed  of  u;  x  ^  processors  where  w  is  the  band  width  of  the  matrix  and  9,the 
number  of  subdiagonals. The  matrix  is  introduced  diagonal  by  diagonal  and  the  QR  factorization 
is  done  in  2(s  +  {9  -  1))  time  steps  (instead  of  3s  for  the  array  of  Gentleman  and  Rung). Every  row 
of  the  array  is  composed  of  one  cell  which  creates  the  rotations  and  that  the  diagonal  of  elements 
to  be  annihilated  crosses  and  of  (u;  -  1)  cells  which  apply  and  broadcast  the  rotations  to  the  other 
diagonals. The  computation  of  the  right  hand  member  uk-i  =  Q^qk-i  can  be  done  on  the  same 
array  if  we  add  q  supplementary  cells. 

As  (Wq  -  al)  is  tridiagonal ,ti;  =  3  and  q  =  l.So  we  will  only  need  4  processors, 3  for  the 
computation  of  Q^{Wq  —  <rl)  =  R  (1)  and  one  for  the  computation  of  Q^qk-i  —  ut-i  (2)  R  will 
have  then  2  upper  diagonals  and  the  band  trieingular  system  Rzk  =  tit-i  (3)  will  be  solved  by  3 
processors  linearly  connected  as  described  by  Rung  and  Leiserson  [12]  .For  the  operations  (1)  and 
(2)  we  can  use  for  example  the  fourth  row  of  our  tiangular  array  composed  of  4  processors, and 
for  the  operation  (3), the  third  row.  The  figure  9  represents  the  QR  factorization  of  (Wq  -  <t/). 
(fVo  -  al)  =  (c,_i,(6.  -  a),  as  well  as  qk-i  =  (9i)i<t<»  introduced  into  the  fourth 

row  of  the  array  and  the  matrix  R  =  as  well  as  the  right  hand  side 

u;^_i  =  are  stored  into  internal  stacks  (LIFO)  of  the  four  processors. The  operation  (3) 

described  in  the  figure  10  is  executed  from  the  data  from  the  LIFOs.  As  we  get  the  last  value  u«  at 
the  instant  (2s  +  2)T,the  resolution  of  the  triangular  system  Rzk  =  Uk~i  car,  begin  at  the  instemt 
(2s  +  3)T  and  we  get  z*  after  (2s  +  1)  time  steps. 

The  speed  of  convergence  of  the  series  {qk)keN  to  the  eigenvector  depends  on  the  accuracy  of 
the  eigenvalue. In  fact  if  <r  is  such  that  {W  —  al  +  E)  is  singular  with  ||E||2  =  e  then  it  exists  at 
least  one  vector  q  such  that  the  solution  z  of  {W  -  al)z  =  q  determines  an  eigenvector  z  of  W 
exact  at  the  precision  e  (that  is  to  say  such  that  z  is  an  eigenvector  of  W  +  F  with  ||F||2  =  0(€);8ee 
p  173  of  [4]). So  we  can  get  from  the  approximation  a  of  A, with  one  step  of  the  inverse  iteration 
method, an  approximation  of  the  eigenvector  of  the  same  order  of  accuracy  in  (4s  +  3)  time  steps. 
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We  can  reduce  this  execution  time  to  an  average  of  2s  time  steps  per  eigenvector  by  adding  a 
LIFO  to  each  of  the  four  cells  of  the  fourth  row. Another  resolution  of  tridiagonal  system 
[Wq  -  a'I)z\  =  9o  can  then  begin  at  the  instant  2sT',the  new  matrix  R'  =  Q'  (Wq  -  a'  1)  and 
the  new  right  hand  member  uo  =  Q'^<io  being  stored  in  the  new  LIFOs. 

If  qi  is  an  eigenvector  of  Wo.then  P^qi  is  an  eigenvector  of  W'.lt  remains  then  to  compute  this 
matrix- vector  product. We  can  get  the  matrix  P  by  introducing  the  identity  matrix  I,  just  after  the 
matrix  W  during  its  tridiagonalization  (from  the  instant  (s  -t-  l)T:see  figures  5-6-7)  and  by  only 
applying  the  forward  peisses  to  I,  (resending  of  the  rotations  memorized  in  the  diagonal  processors 
along  the  rows  from  the  right  to  the  left). The  product  P^qi  can  be  done  in  2s  time  steps  with  s 
processors  linearly  connected. We  have  only  to  introduce  the  matrix  P  row  by  row  and  the  vector 
q  coordinate  by  coordinate  (the  ith  row  and  the  iih  coordinate  entering  the  ith  processor). 

5.  Conclusion: 

The  study  of  systolic  algorithms  for  matrix  computations  on  our  triangular  systolic  array 
SARDA  has  showed  us  the  possibility  to  use  a  systolic  array  for  a  great  number  of  systolic  algorithms 
without  having  to  modify  the  interconnection  structure  between  the  different  operations. In  fact  the 
processors  have  only  to  have  a  memory  large  enough  to  allow, during  the  programmation  of  the 
nodes, the  creation  of  a  great  number  of  processes  and  internal  or  external  FIFOs  and  LIFOs. 

On  the  other  hand  the  asynchronous  programmation  of  the  algorithms  has  allowed  us  to 
simplify  the  supervision  operations  of  the  host  system  and  to  save  time.  These  operations  are  now 
reduced  to  sending  the  data  and  receiving  the  results. Nevertheless  we  have  to  replace  the  global 
synchronization  of  the  system  by  local  controls  in  each  cell  such  as  tests  or  loops  counters  (see  [15]). 

However  this  study  has  above  all  pointed  out  the  necessity  of  adding  to  the  systolic  array 
another  computational  system  to  process  the  data  efficiently,  that  is  to  8ay,receive  the  results 
and  store  the  intermediate  computations  such  that  the  systolic  array  be  as  active  as  possible. In 
particular  the  system  should  be  able  to  store  automatically  the  matrices  by  rows,columns  or  blocks 
and  to  transpose  the  matrices  asynchronously.In  that  way  the  sending  of  data  and  receiving  of 
results  could  be  done  asynchronously  m  the  operations  inside  the  array. 

On  page  18  a  summary  table  describes  the  algorithms  we  have  studied  for  the  symmetric 
eigenvalue  problem. First, to  determine  the  eigenvalues  (Jacobi  or  tridiagonalization  -|-  multisec¬ 
tion), second, to  determine  the  eigenvectors  (inverse  iteration  method). 
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Figure  9;  QR  factorization  of  {Wq  -  al)\  computation 
of  Q  such  that  Q^{Wo  —  ffl)  =  R  and  computation  of 
u*_i  =  Q'^qk-i  (»  =  4). 

These  computations  require  (2s  +  2)  time  steps. 
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algorithm 

shape  and  connections 
of  the  array 

number  of  time  steps 

Jacobi  method  on  a 
symmetric  matrix  of 
size  a  X  a 

square  array  of  fl*/4 
processors  with  octogo- 
nal  connections 

3a  log  a  time  steps 

tridiagonalization  of  a 
symmetric  matrix  of 
size  a  X  a 

triangular  array  of 
«(s  -1-  l)/2  processors 
with  orthogonal  con¬ 
nections  +  a  systolic 
shifter  [fig  1-3] 

(a*  4-  3a  -  7)  time  steps 

triangular  array  of 
s(s  +  l)/2  processors 
with  horizontal  and 
diagonal  connections 
and  internal  LIFOs  [fig 
4-6] 

(2s^  —  8)  time  steps 

LR-Cholesky  method 
on  a  tridiagonal  matrix 
of  size  a  X  a 

triangular  array  of 
a{a  -f  l)/2  proces¬ 
sors  with  orthogonal 
connections 

very  slow  convergence. 
2s  time  steps  per 
iteration 

multisection  method 
of  order  («  4-  2)  on  a 
tridiagonal  matrix  of 
size  a  X  a 

(a  -b  2)  processors 
linearly  connected  [fig 

8i 

0((s  4-  2)/log(s  4-  2)) 
time  steps 

inverse  iteration 
method  on  a  tridi¬ 
agonal  matrix  of  size 
a  X  a 

4  processors  linearly 
connected  with  internal 
LIFOs  +  3  processors 
linearly  connected  and 
diagonaly  connected 
[fig  9-10] 

average  of  48  time 
steps  per  eigenvector 
(and  per  iteration) 

singular  value  decom¬ 
position  (SVD)  of  the 
Cholesky  factor  of  W 
of  size  a  X  a 

a/2  processors  linearly 
connected  with  2 

FIFOs  per  processor 

3s  time  steps  4- 
0{a[a  -  l)loga)  time 
steps 

square  array  of  a^/4 
processors  with  octogo- 
nal  connections 

38  time  steps  4- 
0(slog8)  time  steps 

authors  and  special 
features 

Brent-Luk,  Schreiber 
[l],[20]:  utilization  of 
cyclic  permutations 

Schreiber  [17], [18]: 
is  done  column  by 
column  and  row  by  row 


Porta*  [15]:  is  done 
column  by  colunrui  and 
row  by  row 


Schreiber, Porta*  [19] , 
[13] 


HelIer-Ipsen,Kung- 
Leiserson , Porta* 
[91.[12|.[15] 


Schreiber, Brent-Luk 
[19], [1]  Cholesky  fac¬ 
torization  -I-  Hestenes 
method 

Brent-Luk- Van  Loan, 
Luk  [3], [14] '.Cholesky 
factorization  -I-  SVD- 
Jacobi  method 


Table  1:  Eigenelements  computations 
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